In this project, I will be working with a data set of the 1000 highest-rated movies on the IMDb (International Movie Database) website. Using the information provided in this data set, I will create and test a model that will predict the gross revenue of a film based on its other attributes. I will also conduct inference on which factors are actually useful in predicting the gross, and how some factors may be correlated with others.
The data set “IMDB Movies Dataset” includes 1000 observations with 16 variables each. Some of the variables, like Poster_Link which is simply the url to the movie poster, are not relevant to our model so they will be discarded during data cleaning. The other variables I will describe below.
Series_Title is where the name of each movie is stored.
Released_Year is the year in which the movie was first released.
Certificate is the rating certificate that the movie is classified as.
Runtime is the duration of the movie in minutes.
IMDB_Rating is the rating out of 10 that the movie received on the IMDb website.
No_of_Votes is the number of people who ‘voted’ for a film, or assigned it a rating out of 10, on the IMDb website.
Meta_score is the rating of the film out of 100, as calculated from the average of a large group of respected critics’ reviews.
Director is the name of the movie’s director.
Gross is the total amount of money earned by the movie (our outcome variable.)
Star1, Star2, Star3 and Star4 are the names of the four leading actors in the movie.
#read in the data
tidymodels_prefer()
setwd("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv")
IMDB <- read.csv('IMDB-csvdata.csv')
To make sure our data is tidy, we can first use the function clean_names() to make sure the variable names of the data set are unique and include no empty spaces " ". This data set is pretty well-organized, so the only real effect of the function is to standardize the names so that they are all lowercase.
IMDB <- clean_names(IMDB)
names(IMDB)
## [1] "poster_link" "series_title" "released_year" "certificate"
## [5] "runtime" "genre" "imdb_rating" "overview"
## [9] "meta_score" "director" "star1" "star2"
## [13] "star3" "star4" "no_of_votes" "gross"
Next, we must deselect unimportant/ irrelevant variables, such as poster_link and overview.
IMDB <- IMDB %>%
select(-poster_link,-overview)
Now our data set only includes 14 variables for each observation.
Since we are designing a supervised machine learning model to predict gross revenue, we will remove any observations with missing values for gross (these observations are not supervised, so we won’t be able to check the accuracy of their predictions).
#by creating a subset of IMDB where all observations only have positive values of gross, we exclude all observations with missing gross data
IMDB <- subset(IMDB, gross > 0)
Now our data set is made up of 831 observations, with 14 features each.
Before we do any manipulation on the features of the data set, we’ll use summary(is.na()) to take a look at where other missing values might be located.
summary(is.na(IMDB))
## series_title released_year certificate runtime
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:831 FALSE:831 FALSE:831 FALSE:831
##
## genre imdb_rating meta_score director
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:831 FALSE:831 FALSE:750 FALSE:831
## TRUE :81
## star1 star2 star3 star4
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:831 FALSE:831 FALSE:831 FALSE:831
##
## no_of_votes gross
## Mode :logical Mode :logical
## FALSE:831 FALSE:831
##
At this point, the only remaining feature with missing values is meta_score. To make sure our model runs smoothly, we will deselect this feature from the data set, and let the rating of the movie be represented instead by imdb_rating.
IMDB <- IMDB %>%
select(-meta_score)
Some of our features which should be quantitative are stored non-numerically, so our next step in data cleaning is to convert the data for variables runtime, no_of_votes, imdb_rating, and released_year into integers/numeric data.
Runtime is stored in the character format ‘XXX min’, so we will first select only the first three characters and then convert them to an integer value.
IMDB <- IMDB %>%
mutate(
runtime = substr(runtime,1,3),
runtime = as.integer(runtime),
imdb_rating = as.numeric(imdb_rating),
no_of_votes = as.numeric(no_of_votes)
)
Before we can convert Released_Year to an integer, we must fix a small error found in one of the observations, where the value of Released_Year is mistakenly input as the movie certificate ‘PG’. Instead of removing the observation, I will manually enter the release date of the movie so that it can still be used to develop/test our model.
#First we will replace the entry of Released_Year that equals "PG" with the actual year
#The movie Apollo 13 was released in 1995
IMDB$released_year <- replace(IMDB$released_year, IMDB$released_year == "PG",'1995')
#Now we can convert all the entries in Released_Year to integers
IMDB <- IMDB %>%
mutate(
released_year = as.integer(released_year)
)
In order to create movie categories by director, certificate, and star, we will factor the categorical variables. We will only include the first variable star1, since each star variable has hundreds of distinct levels when factored.
IMDB <- IMDB %>%
mutate(
director = factor(director),
star1 = factor(star1),
certificate = factor(certificate)
)
Genre is stored as a list of strings, where each movie is classified as multiple genres. To simplify our model building process, we will remove the variable from the dataset. We will also remove the secondary variables that supply the stars of each movie.
#now we can deselect the string variable 'genre'
IMDB <- IMDB %>% select(-genre,-star2,-star3,-star4)
Our data set is now made up of 9 variables and 831 observations.
After our data has been cleaned and mutated into usable data types, we perform an initial split; 80% of the data is put into the training set, and 20% is set aside for the testing set. Stratified sampling is used on the outcome variable gross.
set.seed(1000)
imdb_split <- initial_split(IMDB, prop=0.8, strata=gross)
imdb_train <- training(imdb_split)
imdb_test <- testing(imdb_split)
dim(imdb_split)
There are 663 observations in our training data set and 168 in our testing data set.
Using the 663 observations (movies) in our training set, I will explore the distribution of certain variables as well as any hypotheses I might have about their correlations or significance to the outcome variable. First, let’s reload our data and take a look at a correlation matrix of all the continuous variables.
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/model_setup.rda")
imdb_cor <- cor(imdb_train[sapply(imdb_train, is.numeric)])
corrplot(imdb_cor, method='number')
The correlation matrix indicates that no_of_votes has a significant correlation with gross, as well as with imdb_rating. More surprisingly, there is also a slight correlation between imdb_rating and runtime.
Before we explore how released_year affects gross and other variables, we will check how the movies in the data set are distributed across the years. To get a feeling for the distribution, we can count the movies by year and ouput some summary statistics.
#first we will sort and count the year distribution and output summary statistics
min_year <- min(imdb_train$released_year)
max_year <- max(imdb_train$released_year)
sort_by_year <- group_by(imdb_train, released_year)
year_counts <- count(sort_by_year)
year_counts_df <- data.frame(year_counts)
max_yc <- max(year_counts_df$n)
min_yc <- min(year_counts_df$n)
mpop_year <- subset(year_counts_df, n == max_yc)
print(paste("The earliest value of released_year in our training data:", min_year))
## [1] "The earliest value of released_year in our training data: 1921"
print(paste("The most recent value of released_year in our training data:", max_year))
## [1] "The most recent value of released_year in our training data: 2019"
print(paste("The year in which the most movies in the data set were released:", mpop_year$released_year))
## [1] "The year in which the most movies in the data set were released: 2004"
print(paste("The maximum number of observed movies released in one year:", max_yc))
## [1] "The maximum number of observed movies released in one year: 27"
Now we look at a visual representation of the variable’s distribution in the training set.
color_set <- c("chocolate2", "tan","burlywood","cornsilk4", "bisque3","brown")
barplot(height = year_counts_df$n, xlab= "Year of Release", ylab= "Number of Movies", names.arg= year_counts_df$released_year,cex.names = 0.6, col= color_set)
From the bar plot above, it appears that the number of movies in the training data released each year starts very low and begins to increase steeply in the 1990s, reaching a maximum in 2004.
Since our model will be predicting gross, it is useful to look at the distribution of the variable over the years; in other words, when were the most high-grossing movies released? First, let’s identify the 10 movies with the largest values of gross.
imdb_sorted <- arrange(imdb_train,desc(gross))
top_10_gross <- head(imdb_sorted,10)
top_10_gross <- top_10_gross %>%
mutate(
gross = substr(gross,1,3),
gross = as.numeric(gross)
)
reds = c("tomato","firebrick2","indianred1","darkred",
"coral", "tomato","firebrick2","indianred1",
"darkred","coral")
top_10_gross %>%
ggplot(aes(series_title, gross)) +
geom_col(fill=reds) +
coord_flip() +
labs(
title="Top 10 Highest-Grossing Movies in Training Data",
y="Gross Revenue (in Millions)",
x= "Movie Title"
)
Now we can look at the distribution of gross in the training data over the years.
summed_grossy <- imdb_train %>%
group_by(released_year) %>%
summarize(Gross=sum(gross))
plot(summed_grossy, xlab= "Released Year", main="Total Gross of Observed Movies Released Each Year", col="darkblue", type="l")
Since we have already observed that the number of movies in our data set released per year increased steeply starting in the late 90’s to 2000’s, it is unsurprising to see that the total gross revenue by year begins to rise around that time. It may be more helpful to divide each year’s gross by the number of movies released, to get a sense of how the average revenue of a single movie has changed over the years.
grossy_avg <- summed_grossy %>%
mutate(
Gross = (Gross / year_counts_df$n)
)
grossy_avg <- grossy_avg %>%
mutate(
Gross= Gross / 1000000
)
plot(grossy_avg, type="l", col="darkgreen", main="Average Gross Per Movie by Year", ylab="Gross in Millions of Dollars", xlab="Released Year")
From the graph above, we can infer that the average gross per movie over the years is always fluctuating, but increases significantly starting in the 1970’s and remains on a clear upward trend from the early 2000’s to 2019.
Let’s take a quick look at the top directors in our data set, that is, the directors who appear most frequently in the training data.
sorted_director <- group_by(imdb_train, director)
director_counts <- count(sorted_director)
director_cdf <- data.frame(director_counts)
director_cdf <- arrange(director_cdf, desc(n))
top_directors <- head(director_cdf,20)
color_set2 <- c("chocolate2", "tan","brown","cornsilk4", "bisque3",
"brown","chocolate2", "tan","burlywood", "pink",
"chocolate2", "tan","brown","cornsilk4", "bisque3",
"brown","chocolate2", "tan","burlywood", "pink")
top_directors %>%
ggplot(aes(director, n)) +
geom_col(fill=color_set2) +
coord_flip() +
labs(
title= "Top 20 Most Frequently-Appearing Directors in Training Data ",
x= "Director",
y= "Number of Observed Movies"
)
The most commonly occurring director is Martin Scorsese, with 9 of his movies appearing in the training data set. Alfred Hitchcock, Christopher Nolan, and Steven Spielberg each directed 8 movies in the training data set, while David Fincher and Woody Allen both directed 7.
Using only the gross data for the movies of each of these 6 directors, we can analyze the relationship between director and gross, to see how these popular directors’ movies fare compared to the gross of an average movie in the training data.
total_gross <- imdb_train %>%
summarize(total_gross=sum(gross))
average_gross <- total_gross$total_gross / 663
top_gross_director <- sorted_director %>%
filter(
director == "Martin Scorsese" |
director == "Christopher Nolan" |
director == "David Fincher" |
director == "Woody Allen" |
director == "Alfred Hitchcock" |
director == "Steven Spielberg"
)
top_direct_counts <- c(8,8,7,9,8,7)
gross_dir_sum <- top_gross_director %>%
summarize(gross = sum(gross))
new_row <- list("Average", average_gross)
gross_dir_sum$gross <- gross_dir_sum$gross / top_direct_counts
gross_dir_sum <- gross_dir_sum %>%
mutate(
director = as.character(director)
)
gross_dir_sum[7,] <- new_row
gross_dir_sum <- gross_dir_sum %>%
mutate(
gross = gross / 1000000
)
sev_colors <- c("chocolate2", "tan","burlywood","cornsilk4", "bisque3","brown","darkred")
gross_dir_sum %>%
ggplot(aes(director,gross)) +
geom_col(fill=sev_colors) +
coord_flip() +
labs(
title= "Average Gross per Movie by Director",
y= "Average Gross Revenue in Millions",
x= "Director"
)
Movies directed by Christopher Nolan and Steven Spielberg have a significantly higher average gross than the average movie in the data set — almost 4 times larger. This suggests that movies directed by certain directors are linked with much higher values of gross.
The observations in the data set “IMDB Top 1000 Movies” were chosen due to their high values of imdb_rating, meaning this variable will probably not have a very wide range. I hypothesize that movies with higher IMDB ratings will have higher values of gross. Let’s view the distribution of the variable across the training data.
rating_grouped <- imdb_train %>%
group_by(imdb_rating)
rating_counts <- count(rating_grouped)
rating_counts %>%
ggplot(aes(imdb_rating, n)) +
geom_col(fill = color_set2[1:16]) +
labs(
title= "IMDB Rating Distribution Across Training Data",
x= "IMDB Rating Score",
y= "Frequency"
)
In the training data, imdb_rating ranges from 7.6 to 9.3, with the bulk of the observations occupying the lower end of the spectrum. The most commonly appearing value in the data set is 7.7, while the maximum value 9.3 is the most uncommon.
Next, we can take a closer look at how gross is distributed within each level of imdb_rating.
rating_grouped2 <- rating_grouped %>%
mutate(
gross = gross / 1000000
)
rating_grouped2 %>%
ggplot(aes(x=imdb_rating,y= gross)) +
geom_point(alpha=1) +
stat_summary(aes(y=gross,group=1), fun=mean, colour="orange", geom="line") +
labs(
title= "Gross Revenue of Observed Movies vs. IMDB Rating",
x= "IMDB Rating Score",
y= "Gross in Millions of Dollars"
)
The scatter plot above shows that gross does increase slightly with higher scores of imdb_rating, but many of the highest-grossing observed movies actually occur around the 7.8-8.4 range.
It might also be interesting to observe how values of imdb_rating have changed over time. To explore the relationship between these variables, we can construct another plot of imdb_rating versus released_year.
sort_by_year %>%
ggplot(aes(x=released_year, y=imdb_rating)) +
geom_point(alpha=1) +
coord_flip() +
labs(
title= "IMDB Rating for Observed Movies by Year of Release",
x= "Released Year",
y="IMDB Rating Score"
)
There is no glaring correlation between imdb_rating and released_year, especially considering the fact that most of the observations are skewed towards recent years and a rating score in the 7-8 range. However, since most of the data is clustered in the top left corner of the plot, we can see that the majority of the lower-rated films in the training data were released recently — over the last 30 years.
From the correlation matrix we looked at in the beginning of our EDA, we know that the feature no_of_votes has a significant correlation with gross. This seems pretty intuitive: the more people vote on a movie, the more people have probably watched it, which would suggest a larger gross revenue. Let’s first explore the distribution of no_of_votes in the training data with a visual representation.
thous_votes <- imdb_train %>%
mutate(
no_of_votes = no_of_votes / 1000 ,
no_of_votes = as.integer(no_of_votes)
)
thous_votes_counts <- thous_votes %>%
group_by(no_of_votes) %>%
count(no_of_votes)
thous_votes_counts %>%
ggplot(aes(no_of_votes, n)) +
geom_line(col="coral") +
geom_vline(xintercept=150, linetype=2)+
geom_vline(xintercept=25, linetype=2)+
labs(
title= "Distribution of the Number of Votes Among Observed Movies",
x= "Thousands of Votes",
y= "Number of Movies"
)
The majority of the movies seem to have a no_of_votes value between 25,000 and 150,000, represented by the area between the two dashed vertical lines. Since our correlation matrix showed a correlation between no_of_votes and our outcome gross, let’s graph the relationship between the two variables.
imdb_train %>%
ggplot(aes(no_of_votes, gross)) +
geom_point(col="darkred") +
geom_smooth(aes(no_of_votes ,gross),method="lm",linetype=2,
fill="gray",col="black")+
labs(
title="Gross Revenue of Films by Number of Votes",
x="Number of Votes Received",
y= "Gross Revenue"
)
The line of best fit drawn through the graph gives us an idea of the overarching relationship between no_of_votes and gross; it looks like there is a very significant positive correlation between the two. As the number of votes received for a movie increases, so does the gross revenue. The more votes a movie receives, the more popular we can assume it is, and this translates into a greater revenue.
Since imdb_rating also showed a correlation with no_of_votes, let’s see how those two variables interact visually. A reminder: the voters being counted in no_of_votes are collectively responsible for creating the imdb_rating value.
imdb_train %>%
ggplot(aes(no_of_votes, imdb_rating)) +
geom_smooth(aes(no_of_votes ,imdb_rating),method="lm",linetype=2,
fill="gray",col="black")+
geom_line(col="brown") +
labs(
title="IMDB Rating by Number of Votes",
x= "Number of Votes (Ratings) Received",
y= "Rating Score (out of 10)"
)
The line of best fit again shows a substantial positive correlation: the movies with the highest rating on IMDb also have the highest number of votes. Good movies, apparently, incentivize more people to vote on them.
In the original dataset, there are four separate predictor variables to represent the stars of the movie: star1, star2, star3, and star4. Since each of these variables are factors with hundreds of levels, we are only including star1, and in our recipe we will group together many of the levels in the variable representing actors whose names rarely appear.
imdb_star <- group_by(imdb_train, star1)
star_counts <- count(imdb_star)
star_counts <- arrange(star_counts, desc(star_counts$n))
star_counts <- head(star_counts,15)
star_counts_df <- data.frame(star_counts)
fifteen_col <- c("pink", "chocolate2", "tan","cornsilk4", "bisque3","brown","darkred", "pink", "chocolate2", "tan","burlywood","cornsilk4", "bisque3","brown","darkred")
star_counts_df %>%
ggplot(aes(star1,n)) +
geom_col(fill=fifteen_col) +
coord_flip() +
labs(
title= "Number of Observed Movies by Main Starring Actor",
y= "Number of Movies",
x= "Starring Actor"
)
The most frequently-occurring actors in the star1 category are Al Pacino, Tom Hanks, and Robert De Niro, each starring in 10 of the observed movies in our training data set. The fourth most-common actor who occurs in the star1 column is Clint Eastwood with a count of 8 films, followed by Leonardo DiCaprio, Denzel Washington, and Christian Bale with 7. Let’s take a look at how the average values of gross for movies starring these actors fare against the average movie in the dataset.
top_gross_stars <- imdb_star %>%
filter(
star1 == "Al Pacino" |
star1 == "Tom Hanks" |
star1 == "Robert De Niro" |
star1 == "Denzel Washington" |
star1 == "Christian Bale" |
star1 == "Leonardo DiCaprio" |
star1 == "Clint Eastwood"
)
top_star_counts <- c(10,7,8,7,7,10,10)
gross_star_sum <- top_gross_stars %>%
summarize(gross = sum(gross))
gross_star_sum$gross <- gross_star_sum$gross / top_star_counts
gross_star_sum <- gross_star_sum %>%
mutate(
star1 = as.character(star1),
)
gross_star_sum[8,] <- new_row
gross_star_sum <- gross_star_sum %>%
mutate(
gross = gross / 1000000
)
eight_colors <- c("chocolate2", "tan","burlywood","cornsilk4", "bisque3","brown","darkred", "pink")
gross_star_sum %>%
ggplot(aes(star1,gross)) +
geom_col(fill=eight_colors) +
coord_flip() +
labs(
title= "Average Gross per Movie by Starring Actor",
y= "Average Gross Revenue in Millions",
x= "Starring Actor"
)
According to this graph, movies whose main star is Leonardo DiCaprio, Christian Bale, or Tom Hanks have a much higher average gross revenue than the rest of the movies in the dataset.
After Data Cleaning and Exploratory Data Analysis, our training data should be ready to be fit to a regression model. The first step we will take is folding the data into 10 “folds”, with 5 repeats, stratified by gross.
imdb_folds <- vfold_cv(imdb_train, v=10, repeats=5, strata=gross)
When we tune and fit our models, we will perform k-fold cross validation across these folds. This is a method by which we test a model’s accuracy repeatedly in order to prevent against overfitting — training the model too specifically to our training data, so that it won’t perform well when applied to new data. In this form of cross-validation, for each of the k folds, that fold is designated as the testing data, and the model is trained on the other k-1 folds. Each of the folds serve as a test set for the model at some point, and the average of all the accuracies is returned.
Now that we have our folded data ready for cross-validation, our next step is to build a recipe for our models. We will be using 6 features to predict gross: released_year, runtime, imdb_rating, director, star1, and no_of_votes. The categorical predictors will need to be dummy-coded. In addition, since director and star1 each have hundreds of levels, we will use step_other() to only create levels for actors and directors who appear in the data set more than 4 or 5 times.
#first step is to build the basic recipe using a formula and the training data
imdb_rec <- recipe(gross ~ released_year + runtime +
imdb_rating + no_of_votes + director +
star1,
data= imdb_train) %>%
step_novel(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_other(star1, threshold= 5) %>%
step_other(director, threshold=4) %>%
step_dummy(director, star1) %>%
#finally we add steps to center and scale (normalize) our predictors
step_normalize(all_predictors())
To make sure we are always working with the same training and testing sets, I saved all our model-building information to a file.
#for security, saving the data, recipe, and folds periodically
save(imdb_train, imdb_test, imdb_folds, imdb_rec, file= "/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/model_setup.rda")
Our next step is to build and run five models: Random Forest, Decision Tree, Boosted Trees, Bagging Trees, and Polynomial Regression.
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/model_setup.rda")
The first model I will be building is a Random Forest model, tuning parameters mtry and min_n for optimization. To begin, I will create a random forest model spec.
#our outcome is continuous and not categorical, so we specify a regression model
random_forest <- rand_forest(min_n= tune(),
mtry=tune(),
mode="regression") %>%
set_engine("ranger")
Next we must create a work flow object and add both our recipe and our model spec.
rf_wf <- workflow() %>%
add_model(random_forest) %>%
add_recipe(imdb_rec)
Since we are tuning the parameter mtry, which represents the number of variables to be chosen randomly for each tree, we must set a plausible range for it and create a grid to be tuned. We want this value to stay within a range smaller than the total number of variables. Parameter min_n is the minimum number of data points required in a node before it splits.
rf_grid <- grid_regular(mtry(range=c(1,15)),
min_n(),levels=10)
Using our grid and work flow, we can now tune the model across the folded training data. Since this model takes a while to tune, we will save it for future use.
metric_rsq <- metric_set(rsq)
tuned_rf <- rf_wf %>%
tune_grid(resamples= imdb_folds,
grid= rf_grid,
metrics=metric_set(rsq,rmse)
)
save(rf_wf, tuned_rf, file= "/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_randomforest.rda")
For our second model, we will be fitting a decision tree. As before, we will start by creating a decision tree regression model specification, using the engine rpart, and signaling that we will be tuning the parameter cost_complexity. This parameter represents the minimum amount of improvement needed at each node of the model, and prunes splits that don’t meet the right criteria.
tree_spec <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
Now we create a workflow object that includes the model spec and our recipe, and a grid of potential values for cost_complexity to take.
tree_wf <- workflow() %>%
add_model(tree_spec) %>%
add_recipe(imdb_rec)
cc_grid <- grid_regular(
cost_complexity(range=c(-4,-1)),
levels=5
)
Now we can combine our workflow and tuning grid to tune the model across our training data folds. Since running these models also takes a good amount of time, the workflow and tuned data are saved to another file.
tuned_dt <- tune_grid(
tree_wf,
resamples= imdb_folds,
grid=cc_grid,
metrics=metric_set(rmse,rsq)
)
save(tuned_dt, tree_wf, file=
"/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_decisiontrees.rda")
Next we will be building a Boosted Tree Model, using the engine xgboost and specifying that parameters trees and tree_depth will be tuned. The trees parameter represents the number of trees to be used in the model, and tree_depth represents the maximum number of splits per tree.
boost_spec <- boost_tree(trees=tune(), tree_depth=tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
As we have done before, we will add the model spec to a workflow and then tune it across a grid of parameter values and our folded training data.
boost_wf <- workflow()%>%
add_model(boost_spec) %>%
add_recipe(imdb_rec)
boost_grid <- grid_regular(trees(range=c(1,1000)),
tree_depth(range=c(1,10)),levels=5)
boost_tuned <- tune_grid(
boost_wf,
resamples=imdb_folds,
grid=boost_grid,
metrics=metric_set(rmse,rsq)
)
save(boost_tuned, boost_wf, file=
"/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_boostedtrees.rda")
For our fourth model, we will try a bagging fit. This uses the same model and engine as random forest, with the specification that all the predictors are used in the model, instead of a range of values for mtry. As we did with our random forest model, we will tune min_n.
bagging_spec <- rand_forest(mtry=.cols(),min_n=tune()) %>%
set_engine("randomForest", importance=TRUE) %>%
set_mode("regression")
Now we can build a workflow, a grid, and tune it across the folded training data.
bagging_wf <- workflow() %>%
add_model(bagging_spec) %>%
add_recipe(imdb_rec)
bag_grid <- grid_regular(min_n(range=c(2,40)), levels=10)
tuned_bag <- tune_grid(
bagging_wf,
resamples=imdb_folds,
grid=bag_grid,
metrics=metric_set(rsq,rmse)
)
save(tuned_bag, bagging_wf, file=
"/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_baggingmodel.rda")
Since our EDA shows a significant correlation between gross and no_of_votes, we will attempt to introduce non-linearity by performing a polynomial expansion (a variation of normal linear regression) on the variable no_of_votes, with an optimized value of degree.
First we will create our new polynomial recipe.
poly_rec <- imdb_rec %>%
step_poly(no_of_votes, degree=tune())
Now we can build our linear regression model spec, workflow, and grid of potential values for tuning the degree parameter.
poly_reg_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
poly_wf <- workflow() %>%
add_model(poly_reg_spec) %>%
add_recipe(poly_rec)
degree_grid <- grid_regular(degree(range=c(1,10)),levels=10)
Finally, we tune the grid and model workflow to our training folds, then save our models.
tuned_pr <- tune_grid(
poly_wf,
resamples=imdb_folds,
grid=degree_grid,
metrics=metric_set(rmse,rsq)
)
save(tuned_pr, poly_wf, file=
"/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_polyreg.rda")
Our models have all been built and tuned, so it’s time to see how they performed and do some comparative analysis.
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_randomforest.rda")
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_decisiontrees.rda")
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_boostedtrees.rda")
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_baggingmodel.rda")
load("/Users/kevin/PSTAT126/PSTAT131/PSTAT131-FinalProject/IMDB_topmovies_csv/tuned_polyreg.rda")
We can visualize the rsq and rmse values of the tuned random forest models using autoplot(). \(R^2\) is a measure of what proportion of the variance of the response can be explained by the predictor variables, and RMSE is the square root of the mean square error.
autoplot(tuned_rf)
We are looking for low values of rmse and a high values of rsq, so from the image above it seems like low values of min_n and higher values of mtry are optimal, specifically when min_n=2 and mtry=13 Let’s take a look at the best performing random forest models below.
head(show_best(tuned_rf, "rsq"),2)
## # A tibble: 2 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 13 2 rsq standard 0.549 50 0.0147 Preprocessor1_Model009
## 2 11 2 rsq standard 0.548 50 0.0144 Preprocessor1_Model008
head(show_best(tuned_rf, "rmse"),2)
## # A tibble: 2 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 13 2 rmse standard 73352250. 50 2132829. Preprocessor1_Model009
## 2 15 2 rmse standard 73567086. 50 2125141. Preprocessor1_Model010
It appears that the two highest-performing models are different based on which metric you are using to compare, however the model where mtry=13 performs the best in both criteria.
Only one parameter was tuned for the decision tree model, so the graph will look a bit different.
autoplot(tuned_dt)
From the graph of the tuned models above, it appears that the highest values of \(R^2\) and the lowest values of RMSE are achieved with lower values of cost_complexity. Now we will take a look at the best-performing model of our tuned decision trees.
head(show_best(tuned_dt, "rsq"),1)
## # A tibble: 1 × 7
## cost_complexity .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 rsq standard 0.393 50 0.0142 Preprocessor1_Model1
head(show_best(tuned_dt, "rmse"),1)
## # A tibble: 1 × 7
## cost_complexity .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 rmse standard 87084412. 50 2256021. Preprocessor1_Mod…
The best decision tree model has an \(R^2\) of only 0.407 and a higher (worse) RMSE than our best tuned random forest models.
Next, a visual representation of our tuned boosted tree models:
autoplot(boost_tuned)
It looks like the boosted tree models achieve optimal rsq and rmse values when tree_depth=3 and trees=250. We can view the best of our boosted tree models below.
head(show_best(boost_tuned, "rsq"),1)
## # A tibble: 1 × 8
## trees tree_depth .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 250 3 rsq standard 0.556 50 0.0184 Preprocessor1_Model07
head(show_best(boost_tuned, "rmse"),1)
## # A tibble: 1 × 8
## trees tree_depth .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 250 3 rmse standard 72097646. 50 2001342. Preprocessor1_Mo…
Our boosted tree rsq is about 0.556, and the rmse is about 72,097,646. To compare with our best random forest models, the values are very similar, but slightly better using boosted tree.
In our bagging model, we used a rand_forest model spec but set mtry equal to the total number of predictors, and tuned the parameter min_n.
autoplot(tuned_bag)
The graph suggests that the model is optimized at min_n=2, and performs increasingly worse with larger nodes. We can view the metrics of the best-performing models below.
head(show_best(tuned_bag,"rsq"),1)
## # A tibble: 1 × 7
## min_n .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rsq standard 0.550 50 0.0170 Preprocessor1_Model01
head(show_best(tuned_bag,"rmse"),1)
## # A tibble: 1 × 7
## min_n .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rmse standard 77671538. 50 2780372. Preprocessor1_Model01
The best performing bagging model has an rsq of 0.55 and an rmse of 77,671,538.
Finally, we assess the performance of our last model, a polynomial regression model with expansion on no_of_votes, the variable most highly-correlated with the outcome variable.
autoplot(tuned_pr)
head(show_best(tuned_pr, "rmse"),2)
## # A tibble: 2 × 7
## degree .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 84798737. 50 3197882. Preprocessor01_Model1
## 2 2 rmse standard 85349126. 50 3176646. Preprocessor02_Model1
head(show_best(tuned_pr, "rsq"),2)
## # A tibble: 2 × 7
## degree .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rsq standard 0.471 50 0.0103 Preprocessor01_Model1
## 2 6 rsq standard 0.471 50 0.0114 Preprocessor06_Model1
The best-performing models of our tuned polynomial regression only achieved an rsq of 0.471 and an rmse of 84,798,737.
To assess which model performed the best, we will look at their stats side by side.
#first we select our best rf model and isolate the rsq and rmse
rand_forest_best <- show_best(tuned_rf, "rsq")
rf_rsq <- head(rand_forest_best$mean,1)
rand_forest_best2 <- head(show_best(tuned_rf,"rmse"),1)
rf_rmse <- rand_forest_best2$mean
#next we do the same with our decision tree model
dt_best <- show_best(tuned_dt, "rsq")
dt_rsq <- head(dt_best$mean,1)
dt_best2 <- show_best(tuned_dt,"rmse")
dt_rmse <- head(dt_best2$mean,1)
#boosted trees
bt_best <- show_best(boost_tuned, "rsq")
bt_rsq <- head(bt_best$mean,1)
bt_best2 <- show_best(boost_tuned,"rmse")
bt_rmse <- head(bt_best2$mean,1)
#bagging model
bag_best <- show_best(tuned_bag, "rsq")
bag_rsq <- head(bag_best$mean,1)
bag_best2 <- show_best(tuned_bag,"rmse")
bag_rmse <- head(bag_best2$mean,1)
#polynomial regression model
pr_best <- show_best(tuned_pr, "rsq")
pr_rsq <- head(pr_best$mean,1)
pr_best2 <- show_best(tuned_pr,"rmse")
pr_rmse <- head(pr_best2$mean,1)
#now we create vectors for each category, and bind the columns together
model_names <- c("Random Forest", "Decision Tree",
"Boosted Tree", "Bagging",
"Polynomial Regression")
model_rsq <- c(rf_rsq, dt_rsq, bt_rsq, bag_rsq,
pr_rsq)
model_rmse <- c(rf_rmse, dt_rmse, bt_rmse,
bag_rmse, pr_rmse)
bind_cols(model=model_names,
rsq=model_rsq,
rmse=model_rmse)
## # A tibble: 5 × 3
## model rsq rmse
## <chr> <dbl> <dbl>
## 1 Random Forest 0.549 73352250.
## 2 Decision Tree 0.393 87084412.
## 3 Boosted Tree 0.556 72097646.
## 4 Bagging 0.550 77671538.
## 5 Polynomial Regression 0.471 84798737.
The Decision Tree model performed the worst, followed closely by Polynomial Regression. The Random Forest, Bagging, and Boosted Tree models all performed pretty similarly to each other.
Since Boosted Trees achieved both the highest \(R^2\) and the lowest RMSE, it is our best-performing model and will be fit to the training and testing data.
First, we need to update our Boosted Tree workflow with the parameter values of the optimal tuned model. For this, we use functions select_best() and finalize_workflow(). Then we fit the workflow to the training data.
best_boost_model <- select_best(boost_tuned, metric="rsq")
final_b_workflow<- finalize_workflow(boost_wf, best_boost_model)
final_boost_fit <- fit(final_b_workflow, data=imdb_train)
Now for the moment of truth: fitting our trained model to our testing data set, and calculating our final RMSE.
set.seed(4848)
test_predictions <- predict(final_boost_fit, new_data=imdb_test) %>%
bind_cols(gross = imdb_test$gross , title= imdb_test$series_title,
director = imdb_test$director)
#ouput the first 10 predicted versus actual gross observations
head(test_predictions,10)
## # A tibble: 10 × 4
## .pred gross title director
## <dbl> <int> <chr> <fct>
## 1 29731624 28341469 The Shawshank Redemption Frank D…
## 2 317024128 315544750 The Lord of the Rings: The Fellowship of the R… Peter J…
## 3 328981856 330252182 Forrest Gump Robert …
## 4 329934112 342551365 The Lord of the Rings: The Two Towers Peter J…
## 5 71709568 53367844 Gisaengchung Bong Jo…
## 6 130569136 136801374 The Green Mile Frank D…
## 7 97327384 100125643 Se7en David F…
## 8 5253946. 19181 City Lights Charles…
## 9 14661955 11286112 The Lives of Others Florian…
## 10 2080704 707481 Oldeuboi Chan-wo…
From the table above, the model seems to be performing very well on the testing set. Now to calculate the RMSE:
final_rmse <- test_predictions %>%
rmse(truth = gross, estimate=.pred)
final_rmse
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 21303629.
The model’s root-mean-square error on the testing data is much lower than the RMSE of the training data! Using a scatterplot, we can see how closely the predicted gross values for our testing set resemble the actual observed values.
test_predictions %>%
ggplot(aes(.pred,gross)) +
geom_point(alpha=2, col="coral") +
geom_abline(gross=test_predictions$.pred) +
labs(
title= "Predicted Versus Observed Gross Revenue",
x= "Predicted Gross Revenue",
y= "Actual Gross Revenue"
)
The black line represents the desired outcome: the points at which our predictions are equal to the actual gross. Since the dots seem to follow the right trajectory, this means the model performs well, especially for lower-grossing films.
Next we can do some exploratory analysis of the model performance by variable. First let’s look at how the RMSE varied within the levels of director. Specifically, we will look at some of the top directors we examined in our EDA.
#first we calculate the RMSE for each director
final_rmse_directors <- test_predictions %>%
select(director, gross, .pred) %>%
group_by(director) %>%
rmse(truth = gross, estimate=.pred)
top_final_rmse_directors <- final_rmse_directors %>%
filter(
director == "Martin Scorsese" |
director == "Christopher Nolan" |
director == "David Fincher" |
director == "Woody Allen" |
director == "Alfred Hitchcock" |
director == "Steven Spielberg" |
director == "Wes Anderson" |
director == "Woody Allen" |
director == "Hayao Miyazaki" |
director == "Clint Eastwood" |
director == "Quentin Tarantino" |
director == "Billy Wilder" |
director == "Peter Jackson"
)
top_final_rmse_directors <- top_final_rmse_directors %>%
mutate(
.estimate = .estimate / 1000000
)
colorsetx <- c("chocolate2", "tan","burlywood","cornsilk4",
"bisque3","brown","darkred", "pink", "coral", "chocolate")
top_final_rmse_directors %>%
ggplot(aes(director,.estimate)) +
geom_col(fill=colorsetx) +
coord_flip() +
labs(
title= "Mean Error In Predicted Gross of Top Directors' Films",
y= "Root Mean Square Error in Millions of Dollars",
x= "Director"
)
The RMSE of movies directed by Spielberg is much higher than that of the other directors, but in general the movies by these top directors all have lower error than the average RMSE of the dataset. This indicates that the model is able to predict movies directed by a top director more accurately.
Next, let’s take a look at the distribution of error by no_of_votes.
test_predictions_rmse <- test_predictions %>%
bind_cols(no_of_votes = imdb_test$no_of_votes) %>%
group_by(no_of_votes) %>%
rmse(truth = gross, estimate=.pred)
test_predictions_rmse %>%
ggplot(aes(no_of_votes, .estimate)) +
geom_point(col="darkred") +
geom_hline(yintercept=(final_rmse$.estimate),linetype=2 )+
labs(
title = "RMSE of Model Predictions as Numbers of Votes Increase",
x= "Number of Votes",
y= "Root Mean Square Error of Model"
)
The graph shows that the RMSE of the model’s predictions decreases steeply when no_of_votes surpasses about 1,000,000. The dotted line represents the average RMSE of the model on the testing data, and all the movies with higher numbers of votes have noticeably lower-than-average error. In other words, higher values of no_of_votes correlate to much better prediction accuracy.
I wanted to build a model that could predict the gross revenue of a movie with some accuracy given the film’s release year, runtime, director, main actor, IMDb rating, and the number of votes it received on the IMDb website.
To incorporate categorical variables that had such a large range of possible values, I had to set a threshold for the number of times a category would occur, and to group together all categories that appeared too infrequently. This definitely led to less specific model-building, but with hundreds of levels in each categorical feature, the data needed to be simplified. In the future, if I were to try again to build a more precise model, I might set the threshold lower to allow for more individual levels to be considered.
In fitting my models, the Decision Tree Model ended up with the lowest rsq by a large margin. This is not too surprising, since Decision Tree models are especially prone to overfitting. The polynomial regression model only performed slightly better; it’s possible that it would have performed well if the expansion had been done on a different continuous variable, but the optimal tuned model was when degree=1, meaning it performed better without any polynomials.
Random forest performed significantly better than the bagging model, which makes sense since there is more tuning involved and a few of the variables in our recipe seem to have little statistical significance in predicting gross.
From analysis of our final model fit to the testing data, it appears that no_of_votes is the feature most useful in predicting our response, though the top actors and directors also have a significant influence on/relationship with gross.
The IMDB TOP 1000 MOVIES dataset clearly is skewed; the movies are all in the imdb_rating range of \(7.6-9.3\), which means that the correlated no_of_votes and gross values are also skewed. To build a model that could accurately predict any random movie’s revenue based on these predictors, it would need to be trained on a much more broad and varied dataset, with many more observations. However, for the limited data that was available, the finished model performed well.